In this notebook, we are interested in exploring the use of miQC as an additional approach to filtering cells to remove any remaining cells that may not be viable or have low sequencing information.
Here, I am particularly comparing the use of miQC to using hard thresholds, in particular removing cells with greater than 20% mitochondrial content, less than 500 genes detected per cell, and less than 1000 total counts per cell.
For the most part, I followed the instructions in the miQC vignette for plotting and filtering.
I’m choosing to compare 2 single-cell samples (SCPCR000126, SCPCR000127) and 2 single-nuclei samples (SCPCR000118, SCPCR000119). We are particularly interested in how miQC will handle single-nuclei samples, as they should not contain many reads corresponding to mitochondrial genes.
library(ggpubr)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
# file paths to emptyDrops filtered sces from Alevin-fry
base_dir <- here::here()
cr_like_results_dir <- file.path(base_dir, "data", "results")
cr_like_sce_file <- file.path(cr_like_results_dir, "alevin-fry-cr-like-em-emptydrops-200-sces.rds")
cr_like_sce <- readr::read_rds(cr_like_sce_file)
# read in mito gene list
sample_info_dir <- file.path(base_dir, "sample-info")
mito_file <- file.path(sample_info_dir, "Homo_sapiens.GRCh38.103.mitogenes.txt")
mito_genes <- readr::read_tsv(mito_file, col_names = "gene_id")
Rows: 111 Columns: 1
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mito_genes <- mito_genes %>%
dplyr::pull(gene_id) %>%
unique()
Before we can do any modeling and perform any filtering, let’s look at the distribution of mitochondrial fraction per cell and unique genes detected per cell. The assumptions of miQC rely on having a distribution of mitochondrial reads and unique genes found. It uses a joint model to look at the proportion of reads mapping to mitochondrial DNA and the number of detected genes and determines the probability of a cell being compromised based on this proportion.
To start we have to calculate the per cell statistics using addPerCellQC().
# add per cell qc
cr_like_sce <- cr_like_sce %>%
purrr::map(scater::addPerCellQC, subsets = list(mito = mito_genes))
plot_miQC_metrics <- function(sce, sample_name){
p <- miQC::plotMetrics(sce) +
ggtitle(sample_name)
print(p)
}
purrr::iwalk(cr_like_sce, plot_miQC_metrics)
Already you can see that the plot shows a distribution of mitochondrial reads from 0-100% for the two single cell samples, but not for the single-nuclei samples. This could affect the ability of the model to distinguish compromised cells.
Now we can actually look at the miQC model and what cells will be considered compromised and filtered out. Here, I’m plotting the probabilities that each cell is compromised as calculated by miQC, which cells would be filtered out using a given cutoff, and then comparing it to if we were to filter our cells using a pre determind threshold.
# function to test miQC modeling with different parameters on sce objects and show plots
model_sce <- function(sce, title, model_type = "linear", posterior_cutoff = 0.75){
# plot distributions of total counts/cell, genes detected/cell and mito content
coldata_df <- data.frame(colData(sce))
manual_filter_plot <- ggplot(coldata_df, aes(x = sum, y = detected, color = subsets_mito_percent)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Total Count",
y = "Number of Genes Expressed",
color = "Mitochondrial\nFraction") +
geom_vline(xintercept = 500) +
geom_hline(yintercept = 1000) +
ggtitle(title) +
theme_classic()
# create model
sce_model <- miQC::mixtureModel(sce, model_type)
# plot cells colored by probability of cell being compromised or not
model_plot <- miQC::plotModel(sce, sce_model) +
theme_classic()
# plot cells that would be filtered using model vs cells that would be filtered using manual QC cutoffs
miQC_filter_plot <- miQC::plotFiltering(sce, sce_model, posterior_cutoff) +
geom_hline(yintercept = 20) +
geom_vline(xintercept = 500) +
theme_classic()
# arrange plots
grid.arrange(manual_filter_plot, model_plot, miQC_filter_plot, nrow = 2)
}
purrr::iwalk(cr_like_sce, model_sce)
Using the suggested parameters from miQC we see that it does a fairly good job of keeping cells with low mito content and high number of genes detected for the single-cell samples, except for a few cases in 126 and 127 where cells with low mito and high unique genes are being thrown out. For the single-nuclei samples however, it looks like the model is throwing away a lot more cells than we may want to probably due to the low range in mitochondrial percentages to begin with. Let’s see if we alter some of the parameters used, like the cutoff for filtering, or the model used to compute the probability of compromised cells, and if that impacts the output.
miQC will remove any cells that have a high probability of being compromised, by default these are cells with higher than 0.75 posterior probability. We can increase that threshold, which means that more cells will be kept and see if we are able to recover some of the cells that were being thrown out previously that we think should have been included (ie. cells with low mito that are being thrown out).
# change the posterior cutoff for miQC to keep throw away cells with > 0.9 probability of being compromised
purrr::iwalk(cr_like_sce, model_sce, posterior_cutoff = 0.9)
It doesn’t appear that increasing the probability threshold rescues very many more cells with low mito content in SCPCR000126 and SCPCR000127 that are being filtered out.
In addition to altering the threshold used for filtering, we can also test the use of different mixture models to calculate the posterior probability of a cell being compromised. We will test use of both the spline and polynomial modes rather than the default linear model. The polynomial model uses a single polynomial to model the entire datset, while spline uses a piecewise continuous function of many polynomials to model the data set.
# test spline mixture model rather than linear
purrr::iwalk(cr_like_sce, model_sce, model_type = "spline")
Use of the spline model definitely doesn’t seem ideal with one sample, SCPCR000126, showing the reverse of what we would expect and only keeping cells with low mito content. For the other samples we see a similar trend that we had observed in the linear model.
# test polynomial mixture model rather than linear
purrr::iwalk(cr_like_sce, model_sce, model_type = "polynomial")
For the linear model, we still see that cells with low mito are getting excluded from SCPCR000118 and SCPCR000119, but now see that same trend in SCPCR000220 and SCPCR000221 as well. It appears to model the cells with very high number of unique genes and low mito content as cells to exclude when we would rather keep those.
Now we can compare filtering using miQC vs. using the pre-determined cutoffs to see how the different filtering effects the overall number of cells that are removed. We can also look at the distribution of each of the metrics used for filtering, genes detected/cell, and mito content/cell to see how filtering affects the population of cells. Here we are using a threshold of 500 genes detected per cell and 10% mitochondrial reads.
# function to filter sce using miQC
miQC_filter <- function(sce, model_type = "linear", posterior_cutoff = 0.75){
sce_model <- miQC::mixtureModel(sce, model_type)
filtered_sce <- miQC::filterCells(sce, sce_model)
}
# filter using manual
manual_filter <- function(sce,
mito_cutoff = 20,
genes_detected_cutoff = 500) {
# get vector of cells to keep
cells_keep <- sce$detected > genes_detected_cutoff &
sce$subsets_mito_percent < mito_cutoff
# filter sce
filtered_sce <- sce[,cells_keep]
}
manual_filtered_sce <- purrr::map(cr_like_sce, manual_filter)
miQC_filtered_sce <- purrr::map(cr_like_sce, miQC_filter)
Removing 1197 out of 1612 cells.Removing 1793 out of 6520 cells.Removing 951 out of 15870 cells.Removing 1162 out of 15673 cells.Removing 1616 out of 5032 cells.Removing 1683 out of 5306 cells.
manual_filtered_sce_10 <- purrr::map(cr_like_sce, manual_filter, mito_cutoff = 10)
manual_filtered_sce_mito_only <- purrr::map(cr_like_sce, manual_filter, genes_detected_cutoff = 0)
# function to grab number of cells from sce
get_num_cells <- function(sce){
num_cells <- dim(sce)[2]
}
manual_cell_num <- manual_filtered_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num) <- c("manual_mito_20")
manual_cell_num_10 <- manual_filtered_sce_10 %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num_10) <- c("manual_mito_10")
manual_cell_num_mito_only <- manual_filtered_sce_mito_only %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num_mito_only) <- c("manual_mito_only")
miQC_cell_num <- miQC_filtered_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(miQC_cell_num) <- c("miQC")
pre_filter_cell_num <- cr_like_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(pre_filter_cell_num) <- c("pre_filtering")
# create dataframe for plotting with number of cells per sce
combined_cell_num_df <- dplyr::bind_rows(manual_cell_num,
miQC_cell_num,
manual_cell_num_10,
manual_cell_num_mito_only,
pre_filter_cell_num) %>%
tibble::rownames_to_column("filtering_method") %>%
tidyr::pivot_longer(cols = starts_with("SCPCR"),
names_to = "sample",
values_to = "number_cells")
ggplot(combined_cell_num_df, aes(x = sample, y = number_cells, fill = filtering_method)) +
geom_col(position = "dodge") +
theme(axis.text.x = element_text(angle = 90))
It looks like manual filtering (with these specific cutoff combinations) are removing more cells than miQC for single-cell, but more cells are removed in single-nuclei samples with either miQC than manual.
Suggested next steps include comparing PCA or UMAP results using the different filtering thresholds to identify if any clusters corresponding to “bad cells” remain. Additionally, we will need to identify if we want to add any additional criteria besides the posterior probability computed by miQC to be included as a “keep” cell in the ccdl_suggests column that will be appended to the colData for every filtered sce object.
Since we are interested in providing a ccdl_suggests column, we need to identify the optimal criteria to be included in that filtering. Here, we will first add the posterior probability column to the sce and then filter the sce using the following combinations:
For these tests we are using the default posterior probability for
To test whether or not we have removed compromised cells, we will look at two types of plots:
We anticipate that if we have removed cells that are compromised then the PCA plots should not show any clustering based on technical artifacts, i.e. number of genes detected or mitochondrial content.
## add posterior probability to colData of sces
add_probability <- function(sce) {
model <- miQC::mixtureModel(sce)
sce <- miQC::filterCells(sce, model, posterior_cutoff = 1, enforce_left_cutoff = FALSE, verbose = FALSE)
}
cr_like_sce <- cr_like_sce %>%
purrr::map(add_probability)
# function to test different combinations of miQC with/without mito cutoff and genes detected cutoff
# outputs PCA plots colored by genes detected and mito content and plot showing distributions of UMI/cell, genes/cell after filtering
test_filtering <- function(sce,title, mito_cutoff = 100, genes_detected_cutoff = 0 ){
# filter sce
cells <- (sce$prob_compromised < 0.75 | sce$subsets_mito_percent < mito_cutoff) &
sce$detected > genes_detected_cutoff
sce <- sce[,cells]
# plot distribution of UMI vs. genes detected
coldata_df <- data.frame(colData(sce))
distribution_plot <- ggplot(coldata_df, aes(x = sum, y = detected, color = subsets_mito_percent)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Total Count",
y = "Number of Genes Expressed",
color = "Mitochondrial\nFraction") +
ggtitle(title) +
theme_classic()
# normalize and transform counts
qclust <- scran::quickCluster(sce)
sce <- scran::computeSumFactors(sce, clusters = qclust)
sce <- scater::logNormCounts(sce)
# model gene variance and select highly variable genes
gene_variance <- scran::modelGeneVar(sce)
subset_genes <- scran::getTopHVGs(gene_variance, n = 2000)
# run PCA using those genes
sce <- scater::runPCA(sce, subset_row = subset_genes)
# make PCA plots
gene_detected_pca <- scater::plotReducedDim(sce, dimred = "PCA", colour_by = "detected")
mito_pca <- scater::plotReducedDim(sce, dimred = "PCA", colour_by = "subsets_mito_percent")
# arrange plots
grid.arrange(distribution_plot, gene_detected_pca, mito_pca, nrow = 2)
}
# miQC pass only, set mito cutoff to 0 so only miQC pass is considered
purrr::iwalk(cr_like_sce, test_filtering, mito_cutoff = 0)
# miQC pass or below 10 % mito
purrr::iwalk(cr_like_sce, test_filtering, mito_cutoff = 10)
# miQC pass or below 10 % mito or above 100 genes detected
purrr::iwalk(cr_like_sce, test_filtering, mito_cutoff = 10, genes_detected_cutoff = 100)
# miQC pass or below 10 % mito or above 200 genes detected
purrr::iwalk(cr_like_sce, test_filtering, mito_cutoff = 10, genes_detected_cutoff = 200)
# miQC pass or below 10 % mito or above 500 genes detected
purrr::iwalk(cr_like_sce, test_filtering, mito_cutoff = 10, genes_detected_cutoff = 500)
sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.0.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
beachmat 2.6.4 2020-12-20 [1] Bioconductor
beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.0.2)
Biobase * 2.50.0 2020-10-27 [1] Bioconductor
BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
BiocNeighbors 1.8.2 2020-12-07 [1] Bioconductor
BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
BiocSingular 1.6.0 2020-10-27 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.0.2)
bluster 1.0.0 2020-10-27 [1] Bioconductor
broom 0.7.9 2021-07-27 [1] CRAN (R 4.0.5)
bslib 0.3.0 2021-09-02 [1] CRAN (R 4.0.2)
car 3.0-11 2021-06-27 [1] CRAN (R 4.0.2)
carData 3.0-4 2020-05-22 [1] CRAN (R 4.0.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
cli 3.0.1 2021-07-17 [1] CRAN (R 4.0.2)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.0.2)
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.0.2)
crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.0.2)
data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.2)
DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
DelayedArray 0.16.3 2021-03-24 [1] Bioconductor
DelayedMatrixStats 1.12.3 2021-02-03 [1] Bioconductor
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.0.2)
DropletUtils 1.13.2 2021-08-30 [1] Bioconductor
edgeR 3.32.1 2021-01-14 [1] Bioconductor
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2)
flexmix * 2.3-17 2020-10-12 [1] CRAN (R 4.0.2)
forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
foreign 0.8-81 2020-12-22 [1] CRAN (R 4.0.5)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
GenomeInfoDb * 1.26.7 2021-04-08 [1] Bioconductor
GenomeInfoDbData 1.2.4 2021-04-09 [1] Bioconductor
GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.2)
ggforce 0.3.3 2021-03-05 [1] CRAN (R 4.0.2)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.0.2)
ggpubr * 0.4.0 2020-06-27 [1] CRAN (R 4.0.2)
ggsignif 0.6.3 2021-09-09 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
haven 2.4.3 2021-08-04 [1] CRAN (R 4.0.2)
HDF5Array 1.18.1 2021-02-04 [1] Bioconductor
here 1.0.1 2020-12-13 [1] CRAN (R 4.0.2)
hms 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.0.5)
igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2)
IRanges * 2.24.1 2020-12-12 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.2)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
knitr 1.34 2021-09-09 [1] CRAN (R 4.0.2)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
lattice * 0.20-44 2021-05-02 [1] CRAN (R 4.0.2)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
limma 3.46.0 2020-10-27 [1] Bioconductor
locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.2)
magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
MASS 7.3-54 2021-05-03 [1] CRAN (R 4.0.2)
Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.0.2)
MatrixGenerics * 1.2.1 2021-01-30 [1] Bioconductor
matrixStats * 0.60.1 2021-08-23 [1] CRAN (R 4.0.2)
miQC 1.1.5 2021-09-14 [1] Github (greenelab/miQC@0eab0b6)
modeltools 0.2-23 2020-03-05 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
nnet 7.3-16 2021-05-03 [1] CRAN (R 4.0.2)
openxlsx 4.2.4 2021-06-16 [1] CRAN (R 4.0.2)
pillar 1.6.2 2021-07-29 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.0.2)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.0.2)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.0.2)
RCurl 1.98-1.4 2021-08-17 [1] CRAN (R 4.0.2)
readr 2.0.1 2021-08-10 [1] CRAN (R 4.0.2)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
remotes 2.4.0 2021-06-02 [1] CRAN (R 4.0.2)
rhdf5 2.34.0 2020-10-27 [1] Bioconductor
rhdf5filters 1.2.1 2021-05-03 [1] Bioconductor
Rhdf5lib 1.12.1 2021-01-26 [1] Bioconductor
rio 0.5.27 2021-06-21 [1] CRAN (R 4.0.2)
rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.0.5)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
rstatix 0.7.0 2021-02-13 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.0.2)
S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
sass 0.4.0 2021-05-12 [1] CRAN (R 4.0.5)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
scater 1.18.6 2021-02-26 [1] Bioconductor
scpcaTools * 0.1.0 2021-09-14 [1] local
scran 1.18.7 2021-04-16 [1] Bioconductor
scuttle 1.0.4 2020-12-17 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor
sparseMatrixStats 1.2.1 2021-02-02 [1] Bioconductor
statmod 1.4.36 2021-05-10 [1] CRAN (R 4.0.2)
stringi 1.7.4 2021-08-25 [1] CRAN (R 4.0.5)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
tibble 3.1.4 2021-08-25 [1] CRAN (R 4.0.5)
tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tinytex 0.33 2021-08-05 [1] CRAN (R 4.0.2)
tweenr 1.0.2 2021-03-23 [1] CRAN (R 4.0.2)
tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.0.5)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.0.2)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.2)
viridis 0.6.1 2021-05-11 [1] CRAN (R 4.0.2)
viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.2)
vroom 1.5.5 2021-09-14 [1] CRAN (R 4.0.5)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
xfun 0.26 2021-09-14 [1] CRAN (R 4.0.5)
XVector 0.30.0 2020-10-28 [1] Bioconductor
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
zip 2.2.0 2021-05-31 [1] CRAN (R 4.0.2)
zlibbioc 1.36.0 2020-10-28 [1] Bioconductor
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library